-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401
Conversation
3c7feb3
to
22f8d9e
Compare
pandas/io/pytables.py
Outdated
@@ -590,16 +590,35 @@ def __enter__(self): | |||
def __exit__(self, exc_type, exc_value, traceback): | |||
self.close() | |||
|
|||
def keys(self) -> List[str]: | |||
def keys(self, kind: Optional[str] = "pandas") -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thanks very much for putting it together!
Why not make "tables" the default and worry about the performance later? (Some "tables"/"table" mixup in the docs btw.)
dask
for example would have to use the keyword to work properly, which would make it incompatible with
any pandas < 1.0.x
that doesn't have this keyword.
Given that, I am not so sure anymore to solve that here and in this way, it is an API change that will most certainly break "userspace", but I'd leave that to the maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making "tables" the default is probably breaking the API, but I can do some experiments by changing it and running the unit tests and see what breaks 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that did not work.
Another suggestion that was previously mentioned is to fall back to the "tables" option if the "pandas" option returns no results. This assumes that we either use an HDF5 file that was produced by pandas and only has pandas style tables, or that we read in an HDF5 file that was produced by another application and has HDF5 style tables.
I think that in this way we stay backwards compatible and we do not have the performance issue with the pandas style tables.
I have tried this and it does not break existing unit tests in the pytables test set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Again, thanks for taking it on and coming up with a solution.
Looks good to me, so we'll see.
Chhers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just come across this as it's stopping me from working with HDF5 files not produced by pandas.
My issue - how will this cope with files that were produced in some upstream system conforming to the HDF5 specification, then data added to it with pandas (so you have a mix)
Surely the real fix is for pandas (or maybe more specifically HDFStore) to conform to the HDF5 specification (so files work with h5py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas has a specific format of meta-data on top of HDF5. This is not likely to change.
assert len(store.keys(kind="tables")) == 3 | ||
expected = {"/group/table1", "/group/table2", "/group/table3"} | ||
assert set(store.keys(kind="tables")) == expected | ||
assert set(store) == set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be
assert set(store.keys()) == set()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This form is equivalent, as the store implements the __iter__
method which just returns an iterator with keys()
I will change it to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, that makes sense indeed.
839db77
to
7adcafb
Compare
@@ -593,13 +593,20 @@ def __exit__(self, exc_type, exc_value, traceback): | |||
def keys(self) -> List[str]: | |||
""" | |||
Return a list of keys corresponding to objects stored in HDFStore. | |||
If the store contains pandas native tables, it will return their names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a good api.
either:
a) always return all tables (api breaking), or
b) add a keyword include='pandas'|'native'|'all', default would be 'pandas'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I am not the PR author but reported the issue.
It really depends on what you want to achieve, IMHO better compatibility would be achieved with option (a) which basically reverts #21543 but for keys()
and leaving groups()
untouched. Don't know if that breaks anything else, but it might hurt performance in some cases although the original bug #21372 complained about the performance of groups()
(it links to #17593 about slow keys()
). There is also a walk()
method which is very similar to groups()
.
(b) would break backwards compatibility of all (future) projects that use that keyword wrt pandas < 1.0.x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest version is trying to keep some backwards compatibility with pandas 0.23 (plus or minus a minor revision). Apparently there are some tools that depend on the original behavior (hence the issue which this PR is trying to fix). So either we keep breaking backward compatibility or we add a keyword.
I think this is a matter of taste/policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I think this is a trade-off between having backwards compatible behavior with v0.23.x vs. a clean API.
I see one vote from @st-bender for being backwards compatible, and I see one vote from @jreback for the clean API
Note sure whose vote has more weight (:smile: but I can make a gues), but I have no strong feelings either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @roberthdevries for pushing this forward and @jreback thanks for the feedback.
As I see it there is no fully compatible way to change that behaviour back to <=0.23.x, and apparently I am the only one that noticed that when dask.dataframe.read_hdf()
suddenly failed to read some files which worked before (dask/dask#5934).
It seems that dask.dataframe
is currently the only serious "user" of that interface, and I filed dask/dask#5934 to decide whether it is better handled within dask
instead of pandas
. The change there would be similarly simple and would probably be more backwards compatible with older pandas versions.
91eb111
to
9fd51cb
Compare
is this better handled upstream in dask? I am ok with a keyword to .keys but cannot change the behavior like this. |
An additional kind parameter has been added that defaults to pandas original behavior, but with 'tables' value gives you the list of non-pandas tables in the file
- improve type annotation of the HDFStore.keys method - minor improvement in ValueError string - minor improvement in doc-string
- first try new behavior - if no result return native HDF5 Tables
9fd51cb
to
a903c6e
Compare
I'll try to find out, there was not a lot of response so far, so I'll ping and ask.
I understand and I am with you wrt not changing the API behavoiur. |
I am not sure it is worthile to change anything here. The fact that HDFStore doesn't recognize non-pandas tables is a non-goal here. |
This is a disappointing attitude, because this used to work. It's not an enhancement, but a regression that was broken by an 'optimisation' Groups and keys aren't a feature of pandas, but a feature of the HDF5 standard. If pandas is going to claim support for the HDF5 file format, then it should support the whole format. This isn't even restricted to the dataframe/tables level. You're basically stating the equivalent of 'supporting files stored in a directory not created by pandas is a non-goal'. At the dataframe level, I don't understand why you'd not want the ability to load data from a non-pandas table and create a dataframe from it - pandas isn't the only library out there for working with HDF5, and it's safe to assume not all users will be working with data that came out of pandas in the first place (got my hand up here.) I do, however understand why you'd not want to be converting back from a dataframe to a 'non-pandas table' - being able to read data from a non-pandas HDF5 but only write pandas tables makes some kind of sense. If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not) Just one person's opinion. EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant |
Sorry to jump in here, but importing non-pandas files should work just fine and was not the issue here. It is just not possible to list all the keys/groups with standard pandas anymore which lead to downstream problems, most notable with dask. You can have a look at the PR here or at the linked dask issue on how to list all of them with pytables. If you have trouble importing hdf5 files in general, then that may be a separate issue. [...]
Yeah I cannot understand that either, but it is the developers' decision to cripple hdf5 support to pandas specific files.
That would be a really bad decision indeed. |
No need to apologise. It is a discussion, after all
The problem isn't just with dask (I'm not using dask). I am trying to import data from non-pandas HDF5 files, to be put into a dataframe, which would then be exported to pandas HDF5 files. My issue was with not being able to list the keys/groups - I have no control of the format of the data coming in, and I need to 'walk' the groups to figure out what's in the file (the file format has a versioning system in it, so I need to walk the groups to figure out how many there are, and where to pull my specific dataset from) - so I guess you could say I have the same issue as dask. They've implemented a workaround which doesn't use the public API of pandas (by drilling down directly into pytables)
No, importing the data from a pandas file works fine, my issue is with groups/keys also.
The biggest issue I have is the 'we are not adding new methods' statement. I understand pandas wants a clean API, and certainly doesn't want to roll back a change made so long ago. What I can't fathom is why pandas couldn't have a seperate raw_keys() and raw_groups() method in addition to the keys() and groups() methods. This wouldn't break the existing API, and would allow us to still walk these groups while still keeping in place this 'optimisation'
What would be a bad decision? I don't understand this last statement. Thanks for your insight |
I believe that the discussion should take place at the issue, not here.
Probably then dask may be actually the right tool, it allows wildcards in the key argument. Otherwise you could probably use plain pytables in the first round to get the keys, and then pandas to read them. I don't think that the issue will be fixed in pandas.
I understand the intention to not change the API. Hdf5 is apparently a very minor part such that nobody really bothers. The best would have been to revert the change and admit that it was screwed up. Would have saved a lot of developer resources.
Sorry, my comment was about I think it is best to continue at the issue, and not fill this pull request further. |
ok I think I commented else where that If you want to support |
prefer soln in #32723 |
First try to get pandas style tables, if there are none, return the list of non-pandas (hdf5 native) tables in the file (if any)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff